Data Analysis and Visualization On Pre-Owned Cars¶

mainheader.jpg

Introduction¶

Here is an outline of the steps we'll follow:

  • Downloading a dataset from an online source.
  • Obtain and understand the dataset
  • Clean, prepare and optimise the data
  • Open-ended exploratory analysis and visualization.
  • summarizing inferences and conclusion.

In this project, we will use the 'Used Cars Dataset' Austin Reese, available on Kaggle scraped from the classified advertisements website Craigslist.

The dataset contains one csv file named vehicles.csv with a size of 1.45 GB.

Before we dive into our exercise, Let's look at the columns we are going to analyze.

The dataset contains a total of 26 columns, listed below are the 17 columns are used to our analysis.

  1. region: Region from where the listing is made.
  2. price:Asking price for the vehicle in the listing.
  3. year:Year of registration of the vehicle listed.
  4. manufacturer: Make of the vehicle listed.
  5. model: Model name of the vehicle listed.
  6. condition: Condition of the vehicle listed.
  7. cylinders:Engine size, based on the number of cylinders it has.
  8. odometer: The number of miles on the odometer of the vehicle.
  9. title_status: Contains the title status of the vehicle. Vehicle titles are certificates for legal ownership of a vehicle.
  10. transmission:The type of transmission on the vehicle.
  11. drive: Contains information about how the drive train delivers its power eg. AWD, FWD, RWD etc.
  12. size: Which size category the vehicle falls in.
  13. type Separates the vehicles on the basis of their type, eg. Hatchback, Pickup, Sedan etc.
  14. lat: Latitude of from where the listing is made.
  15. long: Longitude of from where the listing is made.
  16. posting_date: Date of when the listing was made.
  17. state: State code of where the listing is made.

Data Pre-processing¶

In this section, the chosen dataset is downloaded from Kaggle. This is done using the opendatasets library. On obtaining the dataset we read it using pandas and study the thus obtained pandas dataframe. Some of the important columns, that we plan to use in our analysis, are selected whereas the others are dropped so that the execution times are reduced.

Lets Install Required Packages¶

In [1]:
pip install pyyaml==5.4.1
Requirement already satisfied: pyyaml==5.4.1 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (5.4.1)
Note: you may need to restart the kernel to use updated packages.
[notice] A new release of pip available: 22.2.2 -> 22.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip
In [2]:
pip install numpy pandas-profiling matplotlib seaborn  folium opendatasets geopandas --upgrade 
Requirement already satisfied: numpy in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (1.23.5)
Requirement already satisfied: pandas-profiling in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (3.5.0)
Requirement already satisfied: matplotlib in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (3.6.2)
Requirement already satisfied: seaborn in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (0.12.1)
Requirement already satisfied: folium in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (0.13.0)
Requirement already satisfied: opendatasets in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (0.1.22)
Requirement already satisfied: geopandas in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (0.12.1)
Requirement already satisfied: pydantic<1.11,>=1.8.1 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from pandas-profiling) (1.10.2)
Requirement already satisfied: jinja2<3.2,>=2.11.1 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from pandas-profiling) (3.1.2)
Requirement already satisfied: PyYAML<6.1,>=5.0.0 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from pandas-profiling) (5.4.1)
Requirement already satisfied: multimethod<1.10,>=1.4 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from pandas-profiling) (1.9)
Requirement already satisfied: typeguard<2.14,>=2.13.2 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from pandas-profiling) (2.13.3)
Requirement already satisfied: scipy<1.10,>=1.4.1 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from pandas-profiling) (1.9.3)
Requirement already satisfied: htmlmin==0.1.12 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from pandas-profiling) (0.1.12)
Requirement already satisfied: tqdm<4.65,>=4.48.2 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from pandas-profiling) (4.64.1)
Requirement already satisfied: requests<2.29,>=2.24.0 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from pandas-profiling) (2.28.1)
Requirement already satisfied: statsmodels<0.14,>=0.13.2 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from pandas-profiling) (0.13.2)
Requirement already satisfied: phik<0.13,>=0.11.1 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from pandas-profiling) (0.12.2)
Requirement already satisfied: pandas!=1.4.0,<1.6,>1.1 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from pandas-profiling) (1.5.0)
Requirement already satisfied: visions[type_image_path]==0.7.5 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from pandas-profiling) (0.7.5)
Requirement already satisfied: networkx>=2.4 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from visions[type_image_path]==0.7.5->pandas-profiling) (2.8.7)
Requirement already satisfied: tangled-up-in-unicode>=0.0.4 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from visions[type_image_path]==0.7.5->pandas-profiling) (0.2.0)
Requirement already satisfied: attrs>=19.3.0 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from visions[type_image_path]==0.7.5->pandas-profiling) (21.4.0)
Requirement already satisfied: Pillow in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from visions[type_image_path]==0.7.5->pandas-profiling) (9.2.0)
Requirement already satisfied: imagehash in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from visions[type_image_path]==0.7.5->pandas-profiling) (4.3.1)
Requirement already satisfied: pyparsing>=2.2.1 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from matplotlib) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from matplotlib) (2.8.2)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from matplotlib) (4.37.3)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from matplotlib) (1.0.5)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from matplotlib) (1.4.4)
Requirement already satisfied: packaging>=20.0 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from matplotlib) (21.3)
Requirement already satisfied: cycler>=0.10 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from matplotlib) (0.11.0)
Requirement already satisfied: branca>=0.3.0 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from folium) (0.5.0)
Requirement already satisfied: click in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from opendatasets) (8.1.3)
Requirement already satisfied: kaggle in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from opendatasets) (1.5.12)
Requirement already satisfied: pyproj>=2.6.1.post1 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from geopandas) (3.4.0)
Requirement already satisfied: shapely>=1.7 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from geopandas) (1.8.5.post1)
Requirement already satisfied: fiona>=1.8 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from geopandas) (1.8.22)
Requirement already satisfied: certifi in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from fiona>=1.8->geopandas) (2022.9.24)
Requirement already satisfied: munch in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from fiona>=1.8->geopandas) (2.5.0)
Requirement already satisfied: six>=1.7 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from fiona>=1.8->geopandas) (1.16.0)
Requirement already satisfied: setuptools in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from fiona>=1.8->geopandas) (63.2.0)
Requirement already satisfied: cligj>=0.5 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from fiona>=1.8->geopandas) (0.7.2)
Requirement already satisfied: click-plugins>=1.0 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from fiona>=1.8->geopandas) (1.1.1)
Requirement already satisfied: colorama in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from click->opendatasets) (0.4.5)
Requirement already satisfied: MarkupSafe>=2.0 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from jinja2<3.2,>=2.11.1->pandas-profiling) (2.1.1)
Requirement already satisfied: pytz>=2020.1 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from pandas!=1.4.0,<1.6,>1.1->pandas-profiling) (2022.2.1)
Requirement already satisfied: joblib>=0.14.1 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from phik<0.13,>=0.11.1->pandas-profiling) (1.2.0)
Requirement already satisfied: typing-extensions>=4.1.0 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from pydantic<1.11,>=1.8.1->pandas-profiling) (4.4.0)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from requests<2.29,>=2.24.0->pandas-profiling) (1.26.12)
Requirement already satisfied: charset-normalizer<3,>=2 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from requests<2.29,>=2.24.0->pandas-profiling) (2.1.1)
Requirement already satisfied: idna<4,>=2.5 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from requests<2.29,>=2.24.0->pandas-profiling) (3.4)
Requirement already satisfied: patsy>=0.5.2 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from statsmodels<0.14,>=0.13.2->pandas-profiling) (0.5.3)
Requirement already satisfied: python-slugify in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from kaggle->opendatasets) (6.1.2)
Requirement already satisfied: PyWavelets in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from imagehash->visions[type_image_path]==0.7.5->pandas-profiling) (1.4.1)
Requirement already satisfied: text-unidecode>=1.3 in c:\users\offic\appdata\local\programs\python\python310\lib\site-packages (from python-slugify->kaggle->opendatasets) (1.3)
Note: you may need to restart the kernel to use updated packages.
[notice] A new release of pip available: 22.2.2 -> 22.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip
In [3]:
pip install plotly==5.3.1
^C
Note: you may need to restart the kernel to use updated packages.
In [2]:
#import libraries
import plotly.express as px
import opendatasets as od
import pandas as pd
In [2]:
#First Go To Kraggle and Signin and create new api and keep the api.json in the project folder
#Download the dataset from kaggle by using link
used_cars_url="https://www.kaggle.com/austinreese/craigslist-carstrucks-data"
#downloads the dataset associated with the link provided
od.download(used_cars_url)
Skipping, found downloaded files in ".\craigslist-carstrucks-data" (use force=True to force download)
In [3]:
vehicles_path="craigslist-carstrucks-data/vehicles.csv"
#reads the csv file as pandas dataframe
vehicles_df=pd.read_csv(vehicles_path)

#To check the csv
vehicles_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 26 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   url           426880 non-null  object 
 2   region        426880 non-null  object 
 3   region_url    426880 non-null  object 
 4   price         426880 non-null  int64  
 5   year          425675 non-null  float64
 6   manufacturer  409234 non-null  object 
 7   model         421603 non-null  object 
 8   condition     252776 non-null  object 
 9   cylinders     249202 non-null  object 
 10  fuel          423867 non-null  object 
 11  odometer      422480 non-null  float64
 12  title_status  418638 non-null  object 
 13  transmission  424324 non-null  object 
 14  VIN           265838 non-null  object 
 15  drive         296313 non-null  object 
 16  size          120519 non-null  object 
 17  type          334022 non-null  object 
 18  paint_color   296677 non-null  object 
 19  image_url     426812 non-null  object 
 20  description   426810 non-null  object 
 21  county        0 non-null       float64
 22  state         426880 non-null  object 
 23  lat           420331 non-null  float64
 24  long          420331 non-null  float64
 25  posting_date  426812 non-null  object 
dtypes: float64(5), int64(2), object(19)
memory usage: 84.7+ MB

Data Preparation and Cleaning¶

  • Handling missing data,

  • Handling redundant data,

  • Handling inconsistent data,

  • Remove Unnecessary data.

In [5]:
#list of columns which can be useful for our analysis..

selected_cols=['region','price','year','manufacturer','model','condition',
                'cylinders','fuel','odometer','title_status','transmission'
                ,'drive','size','type','lat','long','posting_date','state']
In [6]:
#With Selected Columns we create another dataframe ..

df_selected =pd.read_csv(vehicles_path, usecols=selected_cols)

About Our Data¶

The dataset contains one csv file named vehicles.csv with a size of 1.45 GB. The original dataset has 26 columns but We're selecting 18 columns for this excercise.

Let's view some basic information about the data frame.¶

In [7]:
df_selected.shape
Out[7]:
(426880, 18)

Dataframe has just over 0.42 millions of data records in 18 columns.¶

  • Now, let's check if there are any duplicate records in the dataframe. If so, let's delete it.
In [8]:
df_selected.duplicated().sum() #checking how many duplicates in total
Out[8]:
37
In [9]:
df_selected.drop_duplicates() #droping duplicate from the data frame
Out[9]:
region price year manufacturer model condition cylinders fuel odometer title_status transmission drive size type state lat long posting_date
0 prescott 6000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN az NaN NaN NaN
1 fayetteville 11900 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ar NaN NaN NaN
2 florida keys 21000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN fl NaN NaN NaN
3 worcester / central MA 1500 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ma NaN NaN NaN
4 greensboro 4900 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN nc NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
426875 wyoming 23590 2019.0 nissan maxima s sedan 4d good 6 cylinders gas 32226.0 clean other fwd NaN sedan wy 33.786500 -84.445400 2021-04-04T03:21:31-0600
426876 wyoming 30590 2020.0 volvo s60 t5 momentum sedan 4d good NaN gas 12029.0 clean other fwd NaN sedan wy 33.786500 -84.445400 2021-04-04T03:21:29-0600
426877 wyoming 34990 2020.0 cadillac xt4 sport suv 4d good NaN diesel 4174.0 clean other NaN NaN hatchback wy 33.779214 -84.411811 2021-04-04T03:21:17-0600
426878 wyoming 28990 2018.0 lexus es 350 sedan 4d good 6 cylinders gas 30112.0 clean other fwd NaN sedan wy 33.786500 -84.445400 2021-04-04T03:21:11-0600
426879 wyoming 30590 2019.0 bmw 4 series 430i gran coupe good NaN gas 22716.0 clean other rwd NaN coupe wy 33.779214 -84.411811 2021-04-04T03:21:07-0600

426843 rows × 18 columns

In [10]:
#Checking Data Once again

df_selected.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 18 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   region        426880 non-null  object 
 1   price         426880 non-null  int64  
 2   year          425675 non-null  float64
 3   manufacturer  409234 non-null  object 
 4   model         421603 non-null  object 
 5   condition     252776 non-null  object 
 6   cylinders     249202 non-null  object 
 7   fuel          423867 non-null  object 
 8   odometer      422480 non-null  float64
 9   title_status  418638 non-null  object 
 10  transmission  424324 non-null  object 
 11  drive         296313 non-null  object 
 12  size          120519 non-null  object 
 13  type          334022 non-null  object 
 14  state         426880 non-null  object 
 15  lat           420331 non-null  float64
 16  long          420331 non-null  float64
 17  posting_date  426812 non-null  object 
dtypes: float64(4), int64(1), object(13)
memory usage: 58.6+ MB

Let's check for missing values using df.isna() funtion. This function detect's missing values and .sum() returns the sum of the values over the requested columns.¶

In [11]:
df_selected.isna().sum()
Out[11]:
region               0
price                0
year              1205
manufacturer     17646
model             5277
condition       174104
cylinders       177678
fuel              3013
odometer          4400
title_status      8242
transmission      2556
drive           130567
size            306361
type             92858
state                0
lat               6549
long              6549
posting_date        68
dtype: int64
In [12]:
df_selected['price']= df_selected[df_selected['price'] < 40000]['price']

Let's convert above table into percentages and make a bar graph¶

In [13]:
missing_percentages = df_selected.isna().sum().sort_values(ascending = False) / len(df_selected)
missing_percentages[missing_percentages != 0]
missing_percentages[missing_percentages != 0].plot(kind='barh');

There are more than 10 columns with the missing values. Let's analyze each column and decide how we are going to proceed with the missing values.¶

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------¶

Size :¶

This column is missing around 300 K data records but we cannot assign random values to it. We will replace missing values with¶

'unknown' value using fillna function.¶

In [14]:
df_selected['size'] = df_selected['size'].fillna('unknown')
df_selected.rename(columns = {'size':'carsize'}, inplace = True)

Cylinders :¶

This column is missing around 170 K data records but we cannot assign random values to it. We will replace missing values with¶

'other' value using fillna function.¶

In [15]:
df_selected['cylinders'].value_counts()
Out[15]:
6 cylinders     94169
4 cylinders     77642
8 cylinders     72062
5 cylinders      1712
10 cylinders     1455
other            1298
3 cylinders       655
12 cylinders      209
Name: cylinders, dtype: int64
In [16]:
df_selected['cylinders'] = df_selected['cylinders'].fillna('12 cylinders')

Condition :¶

This column is missing around 170 K data records and almost 120k values are having good status , so we are assigning¶

missing one's to good.¶

In [17]:
df_selected['condition'].value_counts()
Out[17]:
good         121456
excellent    101467
like new      21178
fair           6769
new            1305
salvage         601
Name: condition, dtype: int64
In [18]:
df_selected['condition'] = df_selected['condition'].fillna('good')

Drive :¶

This column is missing around 130 K data records but we cannot assign random values to it. We will replace missing values with¶

'unknown' value using fillna function.¶

In [19]:
df_selected['drive'].value_counts()
Out[19]:
4wd    131904
fwd    105517
rwd     58892
Name: drive, dtype: int64
In [20]:
df_selected['drive'] =  df_selected['drive'].replace(['4wd'],'awd')
df_selected['drive'] = df_selected['drive'].fillna('rwd')

Type :¶

This column is missing around 90 K data records but we cannot assign random values to it. We will replace missing values with¶

'unknown' value using fillna function.¶

In [21]:
df_selected['type'] = df_selected['type'].fillna('hardtop')
df_selected['type'] = df_selected['type'].replace(['pickup'],'convertible')

manufacturer :¶

This column is missing around 17 K data records but we cannot assign random values to it. We will replace missing values with¶

'unknown' value using fillna function.¶

In [22]:
df_selected['manufacturer'] = df_selected['manufacturer'].fillna('unknown')

title-status :¶

This column is missing around 8 K data records and almost 400k values are having clean status , so we are assigning missing¶

one's to clean.¶

In [23]:
df_selected.title_status.value_counts()
Out[23]:
clean         405117
rebuilt         7219
salvage         3868
lien            1422
missing          814
parts only       198
Name: title_status, dtype: int64
In [24]:
df_selected['title_status'] = df_selected['title_status'].fillna('clean')

Latitude :¶

This column is missing around 8 K data records but we cannot assign random values to it. We will replace missing values with¶

0 value using fillna function.¶

In [80]:
df_selected['lat'] = df_selected['lat'].fillna(33.669601)

Longitude :¶

This column is missing around 8 K data records but we cannot assign random values to it. We will replace missing values with¶

0 value using fillna function.¶

In [81]:
df_selected['long'] = df_selected['long'].fillna(-86.817617)

model :¶

This column is missing around 5 K data records but we cannot assign random values to it. We will replace missing values with¶

'unknown' value using fillna function.¶

In [82]:
df_selected['model'] = df_selected['model'].fillna('unknown')

odometer :¶

This column is missing around 4.4 K data records but we cannot assign random values to it. We will replace missing values with¶

0 value using fillna function.¶

In [83]:
df_selected['odometer'] = df_selected['odometer'].fillna(0)

fuel :¶

This column is missing around 3 K data records and almost 300k values are having gas, so we are assigning missing¶

one's to gas.¶

In [84]:
df_selected['fuel'].value_counts()
Out[84]:
gas         359222
other        30728
diesel       30062
hybrid        5170
electric      1698
Name: fuel, dtype: int64
In [85]:
df_selected['fuel'] = df_selected['fuel'].fillna('gas')

transmission :¶

This column is missing around 2 K data records but we cannot assign random values to it. We will replace missing values with¶

'other' value using fillna function.¶

In [86]:
df_selected['transmission'] = df_selected['transmission'].fillna('other')

Data Analysis & Visualization¶

Univariate Analysis¶

In [87]:
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.pyplot import xticks
sns.set_style("darkgrid")

Let's see average car price of each company and using that Cars.¶

In [156]:
df_comp_avg_price = df_selected[['manufacturer','price']].groupby("manufacturer", as_index = False).mean().rename(columns={'price':'brand_avg_price'})
plt1 = df_comp_avg_price.plot(x = 'manufacturer', kind='bar',legend = False, sort_columns = True, figsize = (15,5))
plt1.set_xlabel("Brand")
plt1.set_ylabel("Avg Price ($)")
plt1.invert_xaxis()
xticks(rotation = 90)
plt.show()
C:\Users\offic\AppData\Local\Temp\ipykernel_10532\2369741136.py:2: FutureWarning:

`sort_columns` is deprecated and will be removed in a future version.

Lets Check Top Most Manufactures¶

In [157]:
df_selected['manufacturer'] = df_selected['manufacturer'].replace(['ram'],'ram trucks')
largetosmall = df_selected.groupby('manufacturer').size().sort_values().index[::-1]
print(largetosmall)
Index(['ford', 'chevrolet', 'toyota', 'honda', 'nissan', 'jeep', 'ram trucks',
       'unknown', 'gmc', 'bmw', 'dodge', 'mercedes-benz', 'hyundai', 'subaru',
       'volkswagen', 'kia', 'lexus', 'audi', 'cadillac', 'chrysler', 'acura',
       'buick', 'mazda', 'infiniti', 'lincoln', 'volvo', 'mitsubishi', 'mini',
       'pontiac', 'rover', 'jaguar', 'porsche', 'mercury', 'saturn',
       'alfa-romeo', 'tesla', 'fiat', 'harley-davidson', 'ferrari', 'datsun',
       'aston-martin', 'land rover', 'morgan'],
      dtype='object', name='manufacturer')
It is noticed that some of the brand names have some issues¶
So replacing other names that are not related to car brands¶
In [158]:
fig, ax = plt.subplots(figsize = (15,5))
plt1 = sns.countplot(x= df_selected['manufacturer'], order = largetosmall)
plt1.set(xlabel = 'Brand', ylabel= 'Count of Cars')
xticks(rotation = 90)
plt.show()
plt.tight_layout()
<Figure size 640x480 with 0 Axes>

Lets Check Sale Trends in USA¶

Check how many car sales done in respective states¶

In [159]:
salesorder = df_selected.groupby('state').size().sort_values().index[::-1]
In [160]:
fig, ax = plt.subplots(figsize = (15,5))
plt1 = sns.countplot(x= df_selected['state'], order = salesorder)
plt1.set(xlabel = 'state', ylabel= 'Cars Sales')
xticks(rotation = 90)
plt.show()
plt.tight_layout()
<Figure size 640x480 with 0 Axes>

Let's see how price varies with Fuel Type cars¶

In [161]:
scat_df_fuel=df_selected.query('fuel=="diesel" or fuel=="electric" or fuel == "gas" or fuel == "hybrid"')
#filters out the column for any incorrect values

df_fuel_avg_price = scat_df_fuel[['fuel','price']].groupby("fuel", as_index = False).mean().rename(columns={'price':'fuel_avg_price'})
plt1 = df_fuel_avg_price.plot(x = 'fuel', kind='bar',legend = False, sort_columns = True)
plt1.set_xlabel("Fuel Type")
plt1.set_ylabel("Avg Price ($)")
xticks(rotation = 0)
plt.show()
C:\Users\offic\AppData\Local\Temp\ipykernel_10532\3971592624.py:5: FutureWarning:

`sort_columns` is deprecated and will be removed in a future version.

Electric cars are priced more than gas and diesel cars.¶

Cost Comparision of Car Body with Average Price¶

In [162]:
scat_df_cartype=df_selected.query('type == "SUV" or type=="bus" or type=="convertible" or type=="coupe" or type=="hardtop" or type=="hatchback" or type=="mini-van" or type=="offroad" or type=="sedan" or type=="truck" or type=="van" or type=="wagon"')

df_body_avg_price = scat_df_cartype[['type','price']].groupby("type", as_index = False).mean().rename(columns={'price':'carbody_avg_price'})
plt1 = df_body_avg_price.plot(x = 'type', kind='bar',legend = False, sort_columns = True)
plt1.set_xlabel("Car Body")
plt1.set_ylabel("Avg Price ($)")
xticks(rotation = 90)
plt.show() 
C:\Users\offic\AppData\Local\Temp\ipykernel_10532\2397231527.py:4: FutureWarning:

`sort_columns` is deprecated and will be removed in a future version.

convertible are the expensive and hatchbacks are having decent prices¶

Cost Comparision of Drive Wheel with Average Price¶

In [163]:
df_drivewheel_avg_price = df_selected[['drive','price']].groupby("drive", as_index = False).mean().rename(columns={'price':'drivewheel_avg_price'})
plt1 = df_drivewheel_avg_price.plot(x = 'drive', kind='bar', sort_columns = True,legend = False,)
plt1.set_xlabel("Drive Wheel Type")
plt1.set_ylabel("Avg Price ($)")
xticks(rotation = 0)
plt.show()
C:\Users\offic\AppData\Local\Temp\ipykernel_10532\2732342230.py:2: FutureWarning:

`sort_columns` is deprecated and will be removed in a future version.

Cars with Rear wheel drive have a higher price value.¶

Cost Comparision of Car Dimensions with Price¶

In [164]:
scat_df_carbody=df_selected.query('carsize=="compact" or carsize == "full-size" or carsize == "mid-size" or carsize == "sub-compact"')

df_carsize = scat_df_carbody[['carsize','price']].groupby("carsize", as_index = False).mean().rename(columns={'price':'dimens_avg_price'})
plt2 = df_carsize.plot(x = 'carsize', kind='bar', sort_columns = True,legend = False,)
plt2.set_xlabel("Car Size Type")
plt2.set_ylabel("Avg Price ($)")
xticks(rotation = 0)
plt.show()
C:\Users\offic\AppData\Local\Temp\ipykernel_10532\3044735781.py:4: FutureWarning:

`sort_columns` is deprecated and will be removed in a future version.

According to our data full-size and mid-size are having more cost than compact sizes¶

Cost Comparision of Cylinders with Average price of Cars¶

In [165]:
df_cylindernumber=df_selected.query('cylinders == "10 cylinders" or cylinders == "12 cylinders" or cylinders == "3 cylinders" or cylinders == "4 cylinders" or cylinders == "5 cylinders" or cylinders == "6 cylinders" or cylinders == "8 cylinders"')
df_cylindernumber_avg_price = df_cylindernumber[['cylinders','price']].groupby("cylinders", as_index = False).mean().rename(columns={'price':'cylindernumber_avg_price'})
plt1 = df_cylindernumber_avg_price.plot(x = 'cylinders', kind='bar', sort_columns = True,legend = False)
plt1.set_xlabel("Cylinder Number")
plt1.set_ylabel("Avg Price ($)")
xticks(rotation = 90)
C:\Users\offic\AppData\Local\Temp\ipykernel_10532\2196906222.py:3: FutureWarning:

`sort_columns` is deprecated and will be removed in a future version.

Out[165]:
(array([0, 1, 2, 3, 4, 5, 6]),
 [Text(0, 0, '10 cylinders'),
  Text(1, 0, '12 cylinders'),
  Text(2, 0, '3 cylinders'),
  Text(3, 0, '4 cylinders'),
  Text(4, 0, '5 cylinders'),
  Text(5, 0, '6 cylinders'),
  Text(6, 0, '8 cylinders')])

Eight and twelve cylinder cars have higher price.¶

Trend of number of cars for sale according to the year they were manufactured¶

In [166]:
df_selected['year']=df_selected[df_selected.year >= 1970]['year']
sns.distplot(df_selected['year'])
C:\Users\offic\AppData\Local\Temp\ipykernel_10532\3022398356.py:2: UserWarning:



`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751


Out[166]:
<AxesSubplot: xlabel='year', ylabel='Density'>

*A decrease in trend in cars available for sale from 2009 show that fewer cars were manufactured in 2009 which can be due to the 2009 financial crisis.¶

Range of Car Prices with the Year¶

In [167]:
df=df_selected.sort_values(by=['year'],ascending=False)

plt.figure(figsize=(25,10))
sns.barplot(x=df.year, y=df.price)
plt.xticks(rotation= 90)
plt.xlabel('Year')
plt.ylabel('Price($)')
plt.show()

The above graphs shed light on the fact that due to inflation 2009 and other corona after 2019 with each passing year the average prices of cars and range of car prices increase*¶

Now From Above Breif analysis we are taking most expensive car & most sold car, comparing with their prices over the years¶

From above analysis we have achived:¶

 *Most Expensive car : Tesla
 *Most Sales happened : Ford  

Comparing Tesla Vehicles price changes over the years¶

In [168]:
scat_df_tesla=df_selected.query('manufacturer=="tesla"')
#filters out the tesla cars
scat_df_tesla=scat_df_tesla[scat_df_tesla.year>=1950]
scat_df_tesla= scat_df_tesla[scat_df_tesla.year<2023]
scat_df_tesla
Out[168]:
region price year manufacturer model condition cylinders fuel odometer title_status transmission drive carsize type state lat long posting_date geometry
1008 birmingham 29890.0 2013.0 tesla model s good 12 cylinders electric 61591.0 clean automatic rwd unknown sedan al 33.669601 -86.817617 2021-04-20T04:50:54-0500 POINT (-86.81762 33.66960)
1333 birmingham 32990.0 2013.0 tesla model s good 12 cylinders electric 69961.0 clean automatic rwd unknown sedan al 33.669601 -86.817617 2021-04-14T04:50:50-0500 POINT (-86.81762 33.66960)
1357 birmingham 38750.0 2018.0 tesla model 3 good 12 cylinders electric 14452.0 clean automatic rwd unknown hardtop al 33.499500 -86.726800 2021-04-13T09:37:20-0500 POINT (-86.72680 33.49950)
1806 birmingham 29890.0 2013.0 tesla model s good 12 cylinders electric 61591.0 clean automatic rwd unknown sedan al 33.669601 -86.817617 2021-04-04T10:01:09-0500 POINT (-86.81762 33.66960)
2755 huntsville / decatur NaN 2015.0 tesla model s good 12 cylinders electric 23906.0 clean automatic rwd unknown sedan al 36.138037 -86.731163 2021-04-30T12:13:15-0500 POINT (-86.73116 36.13804)
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
424599 milwaukee NaN 2020.0 tesla model 3 long range sedan good 12 cylinders electric 3996.0 clean other rwd unknown sedan wi 43.060000 -87.960000 2021-04-18T08:02:43-0500 POINT (-87.96000 43.06000)
425311 sheboygan 36590.0 2019.0 tesla model 3 standard range good 12 cylinders electric 23057.0 clean other rwd unknown sedan wi 43.700000 -87.740000 2021-04-30T11:51:04-0500 POINT (-87.74000 43.70000)
425611 wausau 37990.0 2019.0 tesla model 3 standard range good 12 cylinders electric 22544.0 clean other rwd unknown sedan wi 44.960000 -89.640000 2021-05-03T15:11:06-0500 POINT (-89.64000 44.96000)
425747 wausau 32990.0 2013.0 tesla model s signature good 12 cylinders electric 62588.0 clean other rwd unknown sedan wi 44.960000 -89.640000 2021-04-27T13:51:11-0500 POINT (-89.64000 44.96000)
425794 wausau 35990.0 2019.0 tesla model 3 standard range good 12 cylinders gas 31540.0 clean other rwd unknown sedan wi 44.960000 -89.640000 2021-04-26T12:21:09-0500 POINT (-89.64000 44.96000)

868 rows × 19 columns

we are filtering only the tesla vehicles and comparing with its price and year¶

In [169]:
plt.figure(figsize=(25,10))
sns.barplot(x=scat_df_tesla.year, y=scat_df_tesla.price)
plt.xticks(rotation= 90)
plt.xlabel('Year')
plt.ylabel('Price($)')
plt.show()
C:\Users\offic\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\algorithms.py:98: RuntimeWarning:

Mean of empty slice

C:\Users\offic\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\algorithms.py:98: RuntimeWarning:

Mean of empty slice

Comparing Ford Vehicles price changes over the years¶

In [170]:
scat_df_ford=df_selected.query('manufacturer == "ford"')
#filters out the year column for any incorrect values
scat_df_ford=scat_df_ford[scat_df_ford.year>=1990]
scat_df_ford= scat_df_ford[scat_df_ford.year<2023]
scat_df_ford
Out[170]:
region price year manufacturer model condition cylinders fuel odometer title_status transmission drive carsize type state lat long posting_date geometry
31 auburn 15000.0 2013.0 ford f-150 xlt excellent 6 cylinders gas 128000.0 clean automatic rwd full-size truck al 32.592000 -85.518900 2021-05-03T14:02:03-0500 POINT (-85.51890 32.59200)
44 auburn 30990.0 2019.0 ford ranger supercrew xl pickup good 12 cylinders other 1834.0 clean other rwd unknown convertible al 32.590000 -85.480000 2021-05-01T09:21:34-0500 POINT (-85.48000 32.59000)
47 auburn 34590.0 2018.0 ford f150 super cab xl pickup 4d good 6 cylinders gas 20856.0 clean other rwd unknown convertible al 32.590000 -85.480000 2021-04-30T12:51:31-0500 POINT (-85.48000 32.59000)
50 auburn 38990.0 2020.0 ford f150 supercrew cab xlt good 6 cylinders gas 12231.0 clean other rwd unknown convertible al 32.590000 -85.480000 2021-04-30T10:11:00-0500 POINT (-85.48000 32.59000)
53 auburn 27990.0 2020.0 ford ranger supercab xl pickup good 12 cylinders gas 10688.0 clean other rwd unknown convertible al 32.590000 -85.480000 2021-04-29T18:11:11-0500 POINT (-85.48000 32.59000)
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
426791 wyoming 32995.0 2013.0 ford f-350 super duty 4x2 2dr re good 12 cylinders gas 51000.0 clean automatic rwd unknown hardtop wy 36.114900 -115.216100 2021-04-07T10:05:03-0600 POINT (-115.21610 36.11490)
426792 wyoming 32999.0 2014.0 ford f350, xlt excellent 8 cylinders diesel 154642.0 clean automatic awd full-size convertible wy 41.138284 -104.784799 2021-04-07T09:06:54-0600 POINT (-104.78480 41.13828)
426813 wyoming 1000.0 1998.0 ford explorer good 6 cylinders gas 190000.0 clean automatic rwd unknown hardtop wy 41.183600 -104.802300 2021-04-06T15:40:45-0600 POINT (-104.80230 41.18360)
426817 wyoming 20590.0 2018.0 ford edge se sport utility 4d good 12 cylinders gas 37164.0 clean other fwd unknown SUV wy 33.786500 -84.445400 2021-04-06T03:50:18-0600 POINT (-84.44540 33.78650)
426844 wyoming 13977.0 2014.0 ford flex good 12 cylinders gas 112061.0 clean automatic rwd unknown SUV wy 45.762900 -108.539000 2021-04-05T10:42:43-0600 POINT (-108.53900 45.76290)

68486 rows × 19 columns

we are filtering only the ford vehicles and comparing with its price and year¶

In [171]:
plt.figure(figsize=(25,10))
sns.barplot(x=scat_df_ford.year, y=scat_df_ford.price)
plt.xticks(rotation= 90)
plt.xlabel('Year')
plt.ylabel('Price($)')
plt.show()

Comparing Odometer distance travelled with respective Cars¶

In [172]:
columns_od = 'manufacturer == "ford" or manufacturer == "chevrolet" or manufacturer == "toyota" or manufacturer == "honda" or manufacturer == "nissan" or manufacturer == "jeep" or manufacturer == "ramtrucks" or manufacturer == "gmc" or manufacturer == "bmw" or manufacturer == "dodge" or manufacturer == "mercedes-benz" or manufacturer == "hyundai" or manufacturer == "subaru" or manufacturer == "volkswagen" or manufacturer == "kia" or manufacturer == "lexus" or manufacturer == "audi" or manufacturer == "cadillac" or manufacturer == "chrysler" or manufacturer == "acura" or manufacturer == "buick" or manufacturer == "mazda" or manufacturer == "infiniti" or manufacturer == "lincoln" or manufacturer == "volvo" or manufacturer == "mitsubishi" or manufacturer == "mini" or manufacturer == "pontiac" or manufacturer == "rover" or manufacturer == "jaguar" or manufacturer == "porsche" or manufacturer == "mercury" or manufacturer == "saturn" or manufacturer == "alfa-romeo" or manufacturer == "tesla" or manufacturer == "fiat" or manufacturer == "harley-davids" or manufacturer == "ferrari" or manufacturer == "datsun" or manufacturer == "aston-martin" or manufacturer == "landrover" or manufacturer == "morgan3"'
odometer_df = df_selected.query(columns_od)

plt.figure(figsize=(25,10))
sns.barplot(x=odometer_df.manufacturer, y=odometer_df.odometer)
plt.xticks(rotation= 90)
plt.xlabel('Manufacturer')
plt.ylabel('Odometer')
plt.show()

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------¶

Bivariate Analysis¶

Top 10 Most Driven Cars Considering Condition,Model,Price,Brand, Model Year, Distance¶

In [173]:
scat_df=df_selected.query('manufacturer=="ford" or manufacturer=="bmw" or manufacturer=="chevrolet" or manufacturer=="mercedes-benz" or manufacturer=="audi" or manufacturer=="jeep" or manufacturer=="toyota" or manufacturer=="tesla" or manufacturer=="volkswagen" or manufacturer=="lexus"')
#filters out the year column for any incorrect values
scat_df=scat_df[scat_df.year>=1950]
scat_df= scat_df[scat_df.year<2023]
In [174]:
#sets upper and lower limits for price column to eliminate incorrect entries
scat_df= scat_df[scat_df.price>10]
scat_df= scat_df[scat_df.price<40000]
#sets upper and lower limits for odometer column to eliminate incorrect entries
scat_df=scat_df[scat_df.odometer<=400000]
scat_df=scat_df[scat_df.odometer>=100]
In [175]:
#takes sample of a 20000 values from the obtained dataframe for the scatter plot

scat_sample=scat_df.sample(20000)

scat_plot1=px.scatter(scat_sample,x='price',y='odometer',color='manufacturer',hover_data=['year','model','condition'],template='plotly_dark',
                      title="Price-Distance Driven",color_discrete_sequence=px.colors.qualitative.Bold[3:7],
                labels={
                     "price": "Price($)",
                     "odometer": "Distance Driven(Miles)",
                     "year":"Year",
                     "model":"Model",
                     "condition":"Condition",
                     "manufacturer":"Manufacturer"
                                    })
scat_plot1.show()
In [176]:
#takes sample of a 3000 values from the obtained dataframe for the scatter plot

scat_sample2=scat_df.sample(3000)

scat_plot2=px.scatter(scat_sample2,x='price',y='odometer',color='manufacturer',hover_data=['year','model','condition'],template='plotly_dark',
                      title="Price-Distance Driven",color_discrete_sequence=px.colors.qualitative.Bold[3:7],
                labels={
                     "price": "Price($)",
                     "odometer": "Distance Driven(Miles)",
                     "year":"Year",
                     "model":"Model",
                     "condition":"Condition",
                     "manufacturer":"Manufacturer"
                                    })
scat_plot2.show()

Comparing and checking Most Distance driven cars with fuel type¶

Top most driven cars¶

In [177]:
plt.figure(figsize=(25,10))
scat_df_fuel=df_selected.query('manufacturer=="ford" or manufacturer=="bmw" or manufacturer=="chevrolet" or manufacturer=="mercedes-benz" or manufacturer=="audi" or manufacturer=="jeep" or manufacturer=="toyota" or manufacturer=="tesla" or manufacturer=="volkswagen" or manufacturer=="lexus"')

scat_df_fuel=scat_df_fuel[scat_df_fuel.year>=1950]
scat_df_fuel= scat_df_fuel[scat_df_fuel.year<2023]
#sets upper and lower limits for price column to eliminate incorrect entries
scat_df_fuel= scat_df_fuel[scat_df_fuel.price>10]
scat_df_fuel= scat_df_fuel[scat_df_fuel.price<40000]
#sets upper and lower limits for odometer column to eliminate incorrect entries
scat_df_fuel=scat_df_fuel[scat_df_fuel.odometer<=400000]
scat_df_fuel=scat_df_fuel[scat_df_fuel.odometer>=100]
sns.lineplot(x = "year", y = "odometer", data = scat_df_fuel, hue = "manufacturer")
plt.show()

Comparing with every car Brands¶

In [180]:
plt.figure(figsize=(35,10))
scat_df_fuel2=df_selected.query('year >= 1970')
scat_df_fuel2=scat_df_fuel2[scat_df_fuel2.year>=1970]
scat_df_fuel2= scat_df_fuel2[scat_df_fuel2.year<2023]
#sets upper and lower limits for price column to eliminate incorrect entries
scat_df_fuel2= scat_df_fuel2[scat_df_fuel2.price>10]
scat_df_fuel2= scat_df_fuel2[scat_df_fuel2.price<40000]
#sets upper and lower limits for odometer column to eliminate incorrect entries
scat_df_fuel2=scat_df_fuel2[scat_df_fuel2.odometer<=400000]
scat_df_fuel2=scat_df_fuel2[scat_df_fuel2.odometer>=100]
sns.lineplot(x = "year", y = "odometer", data = scat_df_fuel2, hue = "manufacturer")
plt.show()

By Above Analysis:¶

  • Ford,Chevrolet and audi has most Distance driven from the year 2000 to 2016
  • from 2015 Tesla has its highest Distance driven and become constant.
  • after 2019 due to pandemic every car has sightly decreased.

Cars Sales Analysis using Geo Location¶

In [184]:
from shapely.geometry import Point
import geopandas as gpd
from geopandas import GeoDataFrame
geo_df = df_selected.query('state == "ks" or state == "wy" or state == "tn" or state == "nv" or state == "mt"')
geometry = [Point(xy) for xy in zip(geo_df['long'], geo_df['lat'])]
gdf = GeoDataFrame(geo_df, geometry=geometry) 

world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
#Dropping Anatartica as we are not using

world = world[(world.name != "Antarctica") & (world.name != "Fr. S. Antarctic Lands")]

gdf.plot(ax=world.plot(figsize=(10, 6),cmap='Greys',edgecolor='black'), facecolor='white', color='red', markersize=15);

Brand - Year -Condition¶

In [183]:
plt.figure(figsize=(25,10))
sns.violinplot(x=df_selected.year, y=df_selected.condition)
plt.xticks(rotation= 90)
plt.xlabel('Year')
plt.ylabel('Condition')
plt.show()

Violin plot shows that older the car worse its condition .Car condition gets better as we move to newer model.¶

What we Have Done¶

In [182]:
auto = df_selected[['fuel', 'type', 'drive', 'carsize',
       'cylinders','price', 'odometer']]

Visualising all Graphs with create price columns above¶

In [181]:
plt.figure(figsize=(18, 20))
plt.subplot(4,2,1)
scat_df_fuel=df_selected.query('fuel=="diesel" or fuel=="electric" or fuel == "gas" or fuel == "hybrid"')

sns.boxplot(x = 'fuel', y = 'price', data = scat_df_fuel)
plt.subplot(4,2,2)
sns.boxplot(x = 'cylinders', y = 'price', data = df_selected)
plt.subplot(4,2,3)
sns.boxplot(x = 'type', y = 'price', data = df_selected)
plt.subplot(4,2,4)
sns.boxplot(x = 'drive', y = 'price', data = df_selected)
plt.subplot(4,2,5)
sns.boxplot(x = 'carsize', y = 'price', data = df_selected)
plt.subplot(4,2,6)
sns.boxplot(x = 'condition', y = 'price', data = df_selected)
plt.tight_layout()
plt.show()

summary¶

Data Preprocessing with pysprk¶

Data Preparation and Cleaning¶

Data Analysis & Visualization¶

Univariate Analysis¶

 * Car price vs Brand
 * Top Manufactures(Count of cars vs Brand)
 * Top Sale Trends with state(Car sales vs States)
 * Fuel Type vs Price
 * car body vs price
 * drive wheel vs price
 * car dimensions(overall-size) vs price
 * cylinders vs price
 * odometer vs price

Bivariate Analysis¶

 * Top 10 Most Driven Cars Considering Condition,Model,Price,Brand, Model Year, Distance
 * Brand - Year -Condition

Libraries used¶

 *numpy, pandas, pyspark, matplotlib, seaborn, folium, opendatasets, geopandas, ploty, xticks, pyplot, bucket

Graphs used¶

 *bar, barh, countplot, seaborn barplot, scatter plot, violin plot,box plot,lineplot
In [ ]:
 
In [ ]:
 
In [ ]: